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Abstract. Ratios of normalizing constants for two distributions are needed in both Bayesian statistics, 
where they are used to compare models, and in statistical physics, where they correspond to differences 
in free energy. Two approaches have long been used to estimate ratios of normalizing constants. 
The 'simple importance sampling' (SIS) or 'free energy perturbation' method uses a sample drawn 
from just one of the two distributions. The 'bridge sampling' or 'acceptance ratio' estimate can be 
viewed as the ratio of two SIS estimates involving a bridge distribution. For both methods, difficult 
problems must be handled by introducing a sequence of intermediate distributions linking the two 
distributions of interest, with the final ratio of normalizing constants being estimated by the product 
of estimates of ratios for adjacent distributions in this sequence. Recently, work by Jarzynski, and 
independently by Neal, has shown how one can view such a product of estimates, each based on 
simple importance sampling using a single point, as an SIS estimate on an extended state space. This 
'Annealed Importance Sampling' (AIS) method produces an exactly unbiased estimate for the ratio 
of normalizing constants even when the Markov transitions used do not reach equilibrium. In this 
paper, I show how a corresponding 'Linked Importance Sampling' (LIS) method can be constructed 
in which the estimates for individual ratios are similar to bridge sampling estimates. As a further 
elaboration, bridge sampling rather than simple importance sampling can be employed at the top 
level for both AIS and LIS, which sometimes produces further improvement. I show empirically that 
for some problems, LIS estimates are much more accurate than AIS estimates found using the same 
computation time, although for other problems the two methods have similar performance. Like AIS, 
LIS can also produce estimates for expectations, even when the distribution contains multiple isolated 
modes. AIS is related to the 'tempered transition' method for handling isolated modes, and to a 
method for 'dragging' fast variables. Linked sampling methods similar to LIS can be constructed 
that are analogous to tempered transitions and to this method for dragging fast variables, which may 
sometimes work better than those analogous to AIS. 



1 Introduction 



Consider two distributions on the same space, with probabihty mass or density functions 7ro(x) = 
Pq{x)/Zq and 7ri(x) = pi{x)/Zi. Suppose that we are not able to directly compute ttq and tti, but only 
Po and pi, since we do not know the normalizing constants, Zq and Zi. We wish to find a Monte Carlo 
estimate for the ratio of these normalizing constants, Zi/Zq, which we sometimes denote by r, using 
samples of values drawn (at least approximately) from ttq and from tti. Sometimes, we may know Zq, 
in which case we can arrange for it to be one, so that estimation of this ratio will give the numerical 
value of Zi. Other times, we will be able to obtain only the ratio of normalizing constants, but this 
may be sufficient for our purposes. 

In statistical physics, x represents the state of some physical system, and the distributions are 
typically 'canonical' distributions having the following form (for j = 0, 1): 

Pj{x) = ex.p{-/3jU{x,Xj)) (1) 

where U{x,Xj) is an 'energy' function, which may depend on the parameter Xj, and f3j is the inverse 
temperature of system j. Many interesting properties of the systems are related to the 'free energy', 
defined as — log{Zj) / (5j. Often, only the difference in free energy between systems and 1 is relevant, 
and this is determined by the ratio Zi/Zq. 

In Bayesian statistics, x comprises the parameters and latent variables for some statistical model, 
TTo is the prior distribution for these quantities (for which the normalizing constant is usually known), 
and TTi is the posterior distribution given the observed data. We can compute pi{x) as the product 
of the prior density for x and the probability of the data given x, but the normalizing constant, Zi, 
is difficult to compute. We can interpret Zi as the 'marginal likelihood' — the probability of the 
observed data under this model, integrating over possible values of the model's parameters and latent 
variables. The marginal likelihood for a model indicates how well it is supported by the data. 

Although I will use simple distributions as illustrations in this paper, in real applications, x is 
usually high dimensional, and at least one of ttq and tti is usually quite complex. Accordingly, sam- 
pling from these distributions generally requires use of Markov chain methods, such as the venerable 
Metropolis algorithm (Metropolis, et al 1953). See (Neal 1993) for a review of Markov chain sampling 
methods. Sometimes, however, ttq will be relatively simple, and independent points drawn from it can 
be generated efficiently, as would often be the case with the prior distribution for a Bayesian model, 
or for a physical system at infinite temperature (/3o = 0). 

Many methods for estimating ratios of normalizing constants from Monte Carlo data have been 
investigated in the physics literature (for a review, see (Neal 1993, Section 6.2)), and later rediscov- 
ered in the statistics literature (Gelman and Meng 1998). A logical method to start with is 'simple 
importance sampling' (SIS), also called 'free energy perturbation', based on the following identity, 
which can easily be proved on the assumption that no region having zero probability under ttq has 
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non-zero probability under tti: 
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In the above equation, £"7^ denotes an expectation with respect to the distribution ttq, which is 



estimated by a Monte Carlo average over points x 



drawn from ttq (either independently, 



or using a Markov chain sampler). Here and later, will denote an estimate of r = Zi/Zq, found by 
method M. If this estimate is an average of unbiased estimates based on a number of samples, these 

(i) 

individual estimates will be denoted by fj;,/ . 

The simple importance sampling estimate, fgis, will be poor if ttq and vri are not close enough — 
in particular, if any region with non-negligible probability under tti has very small probability under 
ttq. Such a region would have an important effect on the value of r, but very little information about 
it would be contained in the sample from ttq. In such a situation, it may be possible to obtain a good 
estimate by introducing intermediate distributions. Parameterizing these distributions in some way 
using r], we can define a sequence of distributions, 77^0, ■ ■ ■ , 7r^„, with ijo = and = 1 so that the first 
and last distributions in the sequence are ttq and tti , with the intermediate distributions interpolating 
between them. We can then write 



Zq J-J- z„. 



(3) 



Provided that vr^^+i and tt^. arc close enough, we can estimate each of the factors Z^j.^^/Z^. using 
simple importance sampling, and from these estimates obtain an estimate for Zi/Zq. 

Wc can obtain good estimates in a wider range of situations, or using fewer intermediate distributions 
(sometimes none), by applying a technique introduced by Bennett (1976), who called it the 'acceptance 
ratio' method. This method was later rediscovered by Meng and Wong (1996), who called it 'bridge 
sampling'. Lu, Singh, and Kofke (2003) provide a recent review and assessment. One way of viewing 
this method is that it replaces the simple importance sampling estimate for Z^/Zq by a ratio of 
estimates for Z^/Zq and Z^/Zi, where Z^ is the normalizing constant for a 'bridge distribution', 
7r*(x) = p^{x)lZ^,, which is chosen so that it is overlapped by both ttq and tti. Using simple importance 
sampling estimates for Z^/Zq and Z^/Zi, we can obtain the estimate 
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where xo,i, ■ ■ ■ , xq^No drawn from ttq and xi^i, . . . , Xi^Ni are drawn from tti. 
One simple choice for the bridge distribution is the 'geometric' bridge: 
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pf°(x) = y/poix)pi{x) 
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which is in a sense half-way between ttq and tti. As discussed by Bennett (1976) and by Meng and 
Wong (1996), the asymptoticahy optimal choice of bridge distribution is 



where r = Zi/Zq. Of course, we cannot use this bridge distribution in practice, since we do not know 
r. We can use a preliminary guess at r to define an initial bridge distribution, however, which will give 
us a bridge sampling estimate for Zi/Zq. Using this estimate as the new value of r, we can refine our 
bridge distribution, iterating this process as many times as desired. The result of this iteration can 
also be viewed as a maximum likelihood estimate for r, as discussed by Shirts, et al (2003), who argues 
on this basis that it is asymptotically as good as any estimate for r. I have found that estimates with 
r set iteratively are often better than those found with the true value of r (which does not contradict 
optimality of the true value for a fixed choice of bridge distribution) . 

If vTo and vri do not overlap sufficiently, no bridge distribution will produce good estimates, and 
we will have to introduce intermediate distributions as in equation Q. Note, however, that the 
bridge sampling estimate with either of the above bridge distributions converges to the correct ratio 
asymptotically as long there is some region that has non-zero probability under both ttq and tti, a 
much weaker requirement than that for simple importance sampling. 

This advantage of bridge sampling over SIS can be seen in a simple example involving distributions 
that are uniform over an interval of the reals. Let pq{x) = /(•q 3)(2;) and pi{x) = 7(2,4) (2^)) so that 
Zq = 3 and Zi = 2. The simple importance sampling estimate of equation @ does not work, as it 
converges to 1/3 rather than 2/3. However, using a bridge distribution with p*(x) = 1(2,3) i which is 
effectively what both p"''* and pf° will be in this example, the bridge sampling estimate of equation Q 
converges to the correct value, since the numerator converges to 1/3 and the denominator to 1/2. 

Although both simple importance sampling and bridge sampling have been successfully used in many 
applications, they have some deficiencies. One issue is that although the SIS estimate of equation ((JJ 
is unbiased for Zi/Zq, the bridge sampling estimate of equation Q is not, and the same would 
appear to be the case for an estimate using intermediate distributions (via equation Q). This is of 
no direct importance, particularly since we are often more interested in log(Zi/Zo) than in Zi/Zq 
itself. However, it does preclude averaging independent replications of the bridge sampling estimate 
to obtain a better estimate, since the bias would prevent convergence to the correct value as the 
number of replications increases. A more vexing difficulty is that, except sometimes for vro, sampling 
from the distributions vr^ must usually be done by Markov chain methods, which approach the desired 
distribution only asymptotically. To speed convergence, the Markov chain for sampling vr,,^. is often 
started from the last state sampled for vr.^j_i, but it is unclear how many iterations should then be 
discarded before an adequate approximation to the correct distribution is reached. 

Surprisingly, these difficulties can be completely overcome when using simple importance sampling 
with a single point. As shown by Jarzynski (1997, 2001), and later independently by myself (Neal 2001), 
an estimate for Zi/Zq using intermediate distributions as in equation (jSJ will be exactly unbiased if 
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each of the ratios Z^-^^/Z^. is estimated using the simple importance sampling estimate of equation ^ 
with = 1, sampling each distribution with a Markov chain update starting with the point for the 
previous distribution. Averaging the estimates obtained from M independent replications of this 
process (called 'runs') produces the following estimate: 

„ M n-1 r (i)^ M 

^ i=i j=o PvA^j ) j=i 

Here, drawn independently from ttq, and each x^*^ for j > is generated by applying 

a Markov chain transition that leaves tt^. invariant to x^j^-^- This single Markov transition (which 
could, however, consist of several Metropolis or other updates if we so choose), will usually not be 
enough to reach equilibrium, but the estimate Tais is nevertheless exactly unbiased, and will converge 
to the true value as M increases, provided that no region having zero probability under vr^^ has non- 
zero probability under T^r^j^^x- This can be proved by showing how the estimate above can be seen as 
a simple importance sampling estimate on an extended state space that includes the values sampled 
for the intermediate distributions. 

I call this method 'Annealed Importance Sampling' (AIS), since the sequence of distributions used 
often corresponds to an 'annealing' procedure, in which the temperature is gradually decreased. As I 
discuss in (Neal 2001), this allows the procedure to sample different isolated modes of the distribution 
on different runs, properly weighting the points obtained from each of these runs to produce the correct 
probability for each mode. AIS is related to an earlier method for moving between isolated modes 
that I call 'tempered transitions' (Neal 1996). In a recent paper (Neal 2004), I show how tempered 
transitions can be modified to produce a method for efficient Markov chain sampling when some of 
the state variables are 'fast' — ie, when it is possible to more quickly recompute the probability of a 
state when only these fast variables change than when the other 'slow' variables change as well. In this 
method, the fast variables are 'dragged' through intermediate distributions in order to produce more 
appropriate values to go with a proposed change to the slow variables. Deciding whether to accept 
the final proposal involves what is in effect an estimate of the ratio of normalizing constants for the 
conditional distributions of the fast variables. 

In this paper, I show how the ideas behind Annealed Importance Sampling and bridge sampling 
can be combined. I call the resulting method 'Linked Importance Sampling' (LIS), since the two 
samples needed for bridge sampling are linked by a single state that is used in both. Intermediate 
distributions can be used, with each distribution being linked by a single state to the next distribution. 
In contrast to bridge sampling, LIS estimates are unbiased, and as is the case for AIS, they remain 
exactly unbiased even when intermediate distributions are used, and when sampling is done using 
Markov chain transitions that have not converged to their equilibrium distributions. 

Crooks (2000) mentions a different way of combining AIS with bridge sampling — since AIS esti- 
mates are simple importance sampling estimates on an extended state space, we can combine 'forward' 
and 'reverse' estimates to produce a bridge sampling estimate that may be superior. I will call this 
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method 'bridged AIS'. Similarly, such a top-level application of bridge sampling can be combined with 
the low-level application of bridge sampling in LIS, giving what I call 'bridged LIS'. 

Using tests on sequences of one-dimensional distributions, I demonstrate that for some problems 
LIS is much more efficient than AIS — a result that should be expected, since in extreme cases, such 
as for the uniform distributions discussed above, the simple importance sampling estimates underlying 
AIS do not converge to the correct answer even asymptotically, whereas bridge sampling estimates do. 
For some other problems, however, AIS and LIS perform about equally well. The bridged version of 
AIS sometimes performs much better than the unbridged version, but still performs less well than LIS 
and its bridged version on some problems. I also analyse the asymptotic properties of AIS and LIS 
for some types of distribution, providing additional insight into their behaviour. 

Variants of tempered transitions and of my method for dragging fast variables can be constructed 
that are analogous to LIS rather than to AIS. I discuss the 'linked' variant of tempered transitions 
briefly, and include a more detailed description of a linked version of dragging, which may sometimes be 
better than the version related to AIS. I conclude by discussing some possibilities for future research. 

2 The Linked Importance Sampling procedure 

Assume that we can evaluate the unnormalized probability or density functions Pr]{x), for any value 
of the parameter r], with the normalized form of such a distribution being denoted by vr^. The values 
r] = and r] = 1 define the two distributions we are interested in, for which the normalizing constants 
are Zq and Zi . A sequence of n — 1 intermediate values for ij define distributions that will assist in 
estimating the ratio of these normalizing constants, r = Zi/Zq. We denote the values of r] for the 
distributions used hy rjQ, ... ,r]n, with ??o = and 7/„ = 1. Typically, rjj < r/j+i for all j. 

For problems in statistical physics, rj might be proportional to the inverse temperature, P, of 
equation or might map to a value for A. For a Bayesian inference problem, rj might be a power 
that the likelihood is raised to, so that 77 = causes the data to be ignored, and rj = 1 gives full 
weight to the data; the ratio Zi/Zq will then be the marginal likelihood. In both of these examples, 
progressing in small steps from r/ = to 77 = 1 is not only useful in estimating Zi/Zq, but also often 
has an 'annealing' effect, which helps avoid being trapped in a local mode of the distribution. 

2.1 Details of the LIS procedure 

For each distribution, vr^, assume we have a pair of Markov chain transition probability (or density) 
functions, denoted by T^(x,x') and T_^{x,x'), satisfying J Tfj{x, x')dx' = 1 and f T^{x, x')dx' = 1, for 
which the following mutual reversibility relationship holds: 

7r^(x) T^(x, x') = T^-q{x')T_^{x' ,x), for all X and x' (8) 

From this relationship, one can easily show that both T^^ and leave Tin invariant — ie, that 
/ 'iTrj(x)Tri{x, x')dx = 7r^(x'), and the same for T^. If is reversible (ie, satisfies 'detailed balance'), 
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then will be the same as T^. Non-reversible transitions often arise when components of state are 
updated in some predetermined order, in which case the reverse transition simply updates components 
in the opposite order. As a special case, might draw the next state from vr^ independently of the 
current state. Such independent sampling may often be possible for Tq. 

These Markov chain transitions are used to obtain samples that are approximately drawn from each 
of the n + l distributions, tt^^, . . . ,vr^„. We assume that we can begin sampling from ttq by drawing 
a single point independently from ttq. For j > 0, we begin sampling from vr^^ by selecting a link 
state, Xj-uj, from the sample associated with vr^j_i. For all j, we produce a sample of Kj + 1 states 
from this starting point by applying a total of Kj forward (T^^) or reversed {T_^.) Markov transitions. 
Link states are selected using bridge distributions, pj^:j^i, which are defined in terms of Prj- and Prij+i, 
perhaps using the form of equation © or ®, with po replaced by pr^^ and pi by Pr/j^i- 

In detail, the Linked Importance Sampling procedure produces M estimates, r^il, ■ ■ ■ ,r[^iP , that 
are averaged to produce the final estimate, Tlis- Each f^^s is obtained by performing the following: 

The LIS Procedure 

1) Pick an integer uniformly at random from {0, . . . ,Kq}, and then set xq^uq to a value drawn 
from vr^Q . 

2) For j = 0, . . . ,n, sample Kj + 1 states drawn (at least approximately) from vr,^^. as follows: 

a) If j > 0: Pick an integer Uj uniformly at random from {0, . . . , Kj}, and then set Xj^Uj to 

Xj^Uj. 

b) For k = i^j + 1, . . . , Kj, draw rr^ ^ according to the forward Markov chain transition prob- 
abilities Tfj.(xj^k-i, Xj^k)- (If i^j = Kj, do nothing in this step.) 

c) For k = z^j — 1, . . . , 0, draw Xj^k according to the reverse Markov chain transition probabil- 
ities T^^ {xj^k+i, Xj^k)- (If i^j = 0, do nothing in this step.) 

d) If j < n: Pick a value for fij from {0, . . . , Kj} according to the following probabilities: 

ii,{^,j\x,) = pjii±i^^ I Y^pmi^ (9) 

and then set to Xj^^- . 

3) Set fin to a value chosen uniformly at random from {0, . . . , K^}- (This selection has no effect on 
the estimate, but is used in the proof of correctness.) 



4) Compute the estimate from this run as follows: 

Pj*j+i{xj,k) I 1 Pj*j+i(2;j+i,fc) 



n — 1 



n 



^ \ ^ Pj*j+l{Xj,k) j ^ \ ^ 



(10) 



(Note that most of the factors of Xj {Kj + \) and Xj {Kjj^\ + \) cancel, giving a final result of 
[Kn + X) I (Xo + l)) but the redundant factors are retained above for clarity of meaning.) 
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Figure 1: An illustration of Linked Importance Sampling. One intermediate distribution is used, with 
rji = 1/2. The distributions ttq, tti/2, and vri are represented by ovals enclosing the regions of high 
probability under each distribution. Nine Markov chain transitions are performed at each stage. The 
two link states are shown as black dots. The initial and final states (indexed by vq and are shown 
as gray dots. Other states generated by the forward and reverse Markov chain transitions are shown 
as empty dots. For this run, uq = A, fio = 9, 1^1 = 1, /^i = 8, z^2 = 3, and //2 = 7. 

The result of performing steps (1) through (3) is illustrated in Figure^ After M runs of this procedure, 
the final estimate is computed as 

1 ^ 

^-s = ^E^-s (11) 
1=1 

The crucial aspect of Linked Importance Sampling is that when moving from distribution TTrj. to 
T^rij+i, a link state, xj^j+i, is randomly selected from among the sample of points Xj^i, . . . , xj^Kj+i that 
are associated with vr^^. We can view the link state as part of the sample associated with vr^^^i^i as well 
as that associated with vr^^. . Accordingly, when using the 'optimal' bridge of equation I will set 
Nq/Ni to {Kj + l)/{Kj^i + l), though the proof of optimality for bridge sampling does not guarantee 
that this is an optimal choice when using this bridge distribution for LIS. 

2.2 Proof that LIS estimates are unbiased 

In order to prove that Tlis is an unbiased estimate of r = Zi/Zo, we can regard steps (1) through (3) 
above as defining a distribution, IIq, over all the quantities involved in the procedure — namely, Xj, 
Hj, and i^j, for j = 0, . . . , n, with Xj representing Xjfl, . . . , xj^k^ ■ We then consider the procedure for 
generating these same quantities in reverse, which operates as follows: 
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The Reverse LIS Procedure 

1) Pick an integer /i^ uniformly at random from {0, . . . , Kn}, and then set Xn,fj,^ to a value drawn 
from vr^^. 

2) For j = n, . . . ,0, sample Kj + 1 states drawn (at least approximately) from vr-^^ as follows: 

a) If j < n: Pick an integer fij uniformly at random from {0, . . . ,Kj}, and then set Xj^^. to 

Xj^j + l- 

b) For k = fij + 1, . . . , Kj, draw xj^k according to the forward Markov chain transition prob- 
abilities Trj^{xj^k-i, Xj^k)- (If fJ-j = Kj, do nothing in this step.) 

c) For k = fij — 1, ... ,0, draw Xj^k according to the reverse Markov chain transition proba- 
bilities T^.{xj^k+i,Xj^k)- (If fJ-j = 0, do nothing in this step.) 

d) If j > 0: Pick a value for I'j from {0, . . . ,Kj} according to the following probabilities: 

K 

Ili{iy-\x-) = / Pj-i*j{xj,k) 

Prij{Xj,Uj) ' Prjj{Xj,k) 

and then set Xj^i^j to Xj^^.. 

3) Set uq to a value chosen uniformly at random from {0, . . . , Kq}. 

This reverse procedure also defines a distribution over all the quantities generated {xj, Hj, and Uj for 
j = 0, . . . , n), which will be denoted by Hi. 

We now define the unnormalized probability (density) functions Po{x,fi,i') = ZoIIq{x, fj,,^) and 
Pi{x,fi,v) = Zilli{x, fjL,v). The ratio of normalizing constants for these distributions is obviously 
r = Zi/Zq. We can estimate this ratio by simple importance sampling, using the ratios 

n—1 n n 

Zini(;u„)7r^„(x„.^„) n ni(/^j) n ni(^j I D ni(^i l^i)ni(^'o) 

Pi[x,fi,v) ^ i=o i=0 j=l 

Pq{x, Hju) " 

-^0 no (fo)vr^o (2:0,1/0) n no(z^j) ll'^o{Xj\l'j,Xj^,y^) l\ Ilo{flj \ Xj)no{fin) 
j=l j=0 j=0 

From Steps (2b) and (2c) of the forward and reverse procedures, along with the mutual reversibility 
relationship of equation (jHI, we see that 

n 

no{xj\i^j,xj^^^) = Yl T^^i 

Xj^k—l)Xj^k) ' J_J_ Uriji-^jyk+liXj^k 
k=Uj+l k=0 

= TT Tr,Axj,k-l,Xj,k) ■ T\ '^vA^j,k,Xj^k+l) 7 T (15) 

J-J- J-J- 7r„.(Xofc+i) 
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and similarly, 



Hi [xj I fij , Xj^f^. ^ 



From this, we see that parts of the ratio in equation can be written as 



3=0 



Xj I ) ''^jji'j 

i=0 



(17) 



n— 1 / \ 



The last step uses the fact that for j = 1, . . . , n, x^^jy^ = a;j_i*j = 

Prom Steps (1) and (2a), we see that no(^'j) = 1/ {Kj + 1) and ni(/Xj) = 1/ (i^j + l). Using this, 
and again using Xj^y. = we get that 



n— 1 n 

nni(/x,) nni(r/,|x,) 

n n— 1 

n no(z^j) n no(^j Ixj) 

j=l j=0 



n— 1 

n ni(z.,+i|x,+i)(K,-+i+i) 

j=o 

n-1 

n no(M,|x,)(K,+i) 

j=0 



Pj*j+1 



(19) 



^ \ " Pj*j+ 

-1 + 1 ^0 P^'-^ 



(20) 



n— 1 / x n— 1 



fe=0 



Pj*j + l{Xj,k) 



Prj,{Xj^k) I + 1 ^ p^^+,(Xj+i,fc) 



Pj*j+i(a^j+i,fc) 



(21) 



From Steps (1) and (3), we see that no(i'o) = ni(i/o) = 1/ (-K'o + l) and ni(;U„) = no(//„) = 
1 / (-fC„+l), so these factors cancel in equation p3() . The factors in equation (|18() cancel with the first 
part of equation (|21j) . The final result is that the simple importance sampling estimate based on a 
single LIS run is as shown in equation (fTUIl , demonstrating that Tlis is indeed an unbiased estimate of 
r = Zi/Zq. 



2.3 Bridged LIS estimates 

Since the LIS estimate can be viewed as a simple importance sampling estimate on an extended space, 
we can consider a 'bridged LIS' estimate in which this top-level SIS estimate is replaced by a bridge 
sampling estimate. This will require that we actually perform the reverse LIS procedure described 
above, from which an LIS estimate for the reverse ratio, r = Zq/Zi^ can be computed: 



n 



Pj — ^*j {■^j,k) 



1 Pj~l*j{Xj-l,k) 

-1 + 1 ^ Pm-i(^3-i,k 



(22) 
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The reversed procedure requires independent sampling from vri. This will usually not be possible 
directly, but well-separated states from a Markov chain sampler with vri as its invariant distribution 
will provide a good approximation, provided that this sampler moves around the whole distribution, 
without being trapped in an isolated mode. Indeed, the entire sample of Kn + 1 states from vri that is 
needed at the start of the reverse procedure can be obtained by taking consecutive states from such a 
Markov chain sampler. 

For the bridged form of LIS, we also need a suitable bridge distribution, P*, for which we must be 
able to evaluate the ratios P*/Po and P*/Pi. (Note that this choice of a 'top-level' bridge distribution 
is separate from the choices of 'low-level' bridge distributions, though we might use the same 

form for both.) With the optimal bridge of equation Q, these ratios can be written as follows, if the 
forward procedure is performed M times and the reverse procedure M_ times: 



pr{x,ij,u) 

Pi{x,H,v) 



r {M/M) 



Pi{x,^x,u) 
Po{x,^i,iy) 



+ 1 



r(M/M) + 



The geometric bridge of equation @ results in 

Pr{x,fi,iy) 



Po{x,n,u) 

pr{x,n,u) 

Pl{x,^x,v) 



Po{x,ix,v) 
Pi{x,iJ,,u) 



-Pl(x,^,t/) 

Po{x,fi,u) 

' Po{x,^,i/) 
Pi{x,^i,u) 



-1 



(23) 



(24) 



(25) 



(26) 



These expressions allow us to express bridged LIS estimates in terms of the simple LIS estimate of 
equation and its reverse version of equation ((221) • For the optimal bridge, we get 



f opt 

' LlS-bridgod 



1 



M 

E 



1 



r{M/M)/f^l + 1 
Similarly, for the geometric bridge, we get 



r(M/M) + l/f 



E 



1 



(27) 



pgco 
LIS-bridged 



M I 

i=l 



M 



i=l 



(28) 



2.4 LIS estimates with independent sampling with no intermediate distributions 

It is interesting to look at the special case of Linked Importance Sampling with n = 1 — ie, in which 
the are no intermediate distributions between ttq and vri — in which the points from both ttq and 
TTi are sampled independently. The LIS procedure can then be simplified somewhat, and it is also 
possible to improve the LIS estimate by averaging over the choice of link state. Such averaging is not 
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feasible when Markov chain samphng is used, since choosing a different Unk state would require a new 
simulation of the Markov transitions. 



Since we will sample points independently, there is no need to decide how many points will be 
sampled by the forward transitions and how many by the reverse transitions in Steps (2a) and (2b) 
of the LIS procedure. We simply obtain a pair of samples consisting of points xo,0; • • • ,2^0,^0 drawn 
independently from ttq, and points xi^i, . . . ,xi^Ki drawn independently from vri. We then randomly 
select a link state, indexed by fi, from among xo,0) • • • ,xo^Ko according to the following probabilities, 
which depend on the choice of a single bridge distribution, denoted by p*{x): 



no(/i|xo) 



Ko 

E 

fc=0 



P*{xo,k) 
Po{xo,k) 



(29) 



The LIS estimate for r = Zi/Zq based on this pair of samples from ttq and tti is 



' LIS 



P*{xo,k) 



1 



^0 + 1 ^Po(a;o,fc) 



P*ixo,f.) , y pAxi,k) 
Pi(^o,/i) ^Pi(a^i,fc) 



(30) 



The superscript i is used here to indicate that this estimate is based on the z'th pair of samples. We 
can see that it is very similar to the bridge sampling estimate of equation except that the link 
state is included in both samples. Since these LIS estimates are unbiased, we can average M of them 
to obtain a final LIS estimate. 

We can also average the estimate of equation (|3n|) over the random choice of link state, which is 
guaranteed to produce an estimate (also unbiased) with smaller mean-squared-error (see Schervish 
1995, Section 3.2). The result is 



^no(/i|xo) 

IJ.=0 

Ko + l ^ 

IJ,=0 



P*ixo,k) 



1 



-^0 + 1 f^QPo{xo,k) 



P*{xo,ii) 
Po{xo,f,) 



K, 



P*ixo,^l) 

Pl(a^O,At) 

P*{xi,k) 



f^pAx,,k) 

k-- 



f^Pi{xi,k) 



P*{xo,ti) 



(31) 



(32) 



Averaging these estimates over M pairs of samples produces a final estimate denoted by fLis_ave- 

To use bridged LIS in this context, we need to find reverse estimates as well, but these reverse 
estimates needn't be independent of the forward estimates, since the asymptotic validity of the bridge 
sampling estimate of equation @ does not depend on the samples xq and xi being independent. 
Accordingly, we can use the same samples from ttq and vri for the forward and the reverse operations. 
However, to perform reverse sampling, we need to have a sample of Ki + 1 points drawn from vri, the 
first of which is ignored when performing forward sampling. Conversely, the first of the Kq + \ points 
drawn from ttq is ignored when performing the reverse sampling. 

We can improve the bridged LIS estimates by averaging the numerator and the denominator of 
equation H27() or H28() with respect to the random choice of link state. We can also average with 
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respect to the omission of one of the points from one of the samples — ie, rather than omitting 
the first of Ki + 1 points in the sample from vri when computing a forward estimate, we average 
with respect to a random choice of point to omit, and similarly for reverse estimates. Note that the 
averaging should be done over the sums in the numerator and denominator, not with respect to the 
entire estimate, nor with respect to the values of flis and r appearing inside the summands. The 
effective sample size after this additional averaging of dependent points is unclear, so it is not obvious 
what the ratio of sample sizes in equation © should be, but using {Kq + 1) / {Ki + 1) is probably 
adequate. 

3 Analytical comparisons of AIS and LIS 

In this section, I analyse (somewhat informally) the performance of AIS and LIS asymptotically, and 
in other situations where analytical results are possible. 

3.1 Asymptotic properties of AIS and LIS estimates 

I begin by analysing the asymptotic performance of AIS and LIS when the sequence of distributions 
is defined by an unnormalized density function of the following form: 



This class includes sequences of canonical distributions defined by equation in which the inverse 
temperature varies, as well as sequences that can be used for Bayesian analysis, in which pQ defines the 
prior and 77 is a power that the likelihood (expressed as exp(— f7(a;))) is raised to, with 77 = 1 giving the 
posterior distribution. For these distributions, we can express r using the well-known 'thermodynamic 
integration' formula as follows: 



The analysis here is asymptotic, as the number of intermediate distributions used, given by n — 1, 
goes to infinity. I will assume the rjj defining these distributions are chosen according to a scheme in 
which for any a £ (0, 1), the spacing r/j+i — tjj when j = [an\ is asymptotically proportional to 1/n 
— in other words, the relative density of intermediate distributions in the neighborhood of different 
values of r] stays the same as the overall density increases. The simplest such scheme is to let rjj = j/n, 
though other schemes may sometimes be better. 

With the above form for p^, the AIS estimate from a single run (from equation Q) can be written 
as follows: 




(33) 



r 




(34) 



n-1 



n-1 




(35) 



j=0 



j=0 
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When r]j = j/n, this can be seen as a stochastic form of Riemann's Rule for numericahy integrating 
equation (1^1]) , though one difference is that log Tais converges to the correct value as M goes to infinity 
even if n stays fixed. 

Provided that there is some finite bound on the variance of U under all the distributions tt^j, and 
that the Markov transitions used mix well, a Central Limit Theorem will apply, allowing us to conclude 
that the distribution of in = log TaL becomes Gaussian as n goes to infinity. Let the mean of in be 
Hn, and let the variance of in asymptotically be cr^/n, where a is determined by details of the spacing 
of intermediate distributions and of the degree of autocorrelation in the Markov transitions. Note 
that = exp(g/i + c^(^ I'i) when Y = exp(X) and X is Gaussian with mean fi and variance 

Using this, the mean of exp(^„) is exp(^„ + a^/2n). This must equal r, since Tais is unbiased, so 
fXn = log(r) — a'^/2n. Using this, we can see that the variance of f^-ls = ^wi^n) is r [exp((T^/2n) — 1], 
which for large n will be approximately ra'^/2n. The variance of f^is will therefore be ra'^ /2nM . 
Asymptotically, the total computational effort, which will generally be proportional to nM, can be 
divided in any way between more intermediate distributions (n) or more runs (M) without affecting 
the accuracy of estimation of r, provided that n is kept large enough that these asymptotic results 
apply — a fact noted by Hendrix and Jarzynski (2001). We can therefore use a value of M greater 
than one without penalty, in order to obtain an error estimate from the degree of variation over the 
M runs. 

For LIS, we can write the log of the estimate from one run (equation as follows: 



log f^is 



n-1 

E 

j=0 



log 



Pj*j+i (-^jjfc) 



log 



+ PriA^j,k) I '"''l^i+l + l^ P^,+Axj+l,k) 



tE 



' Pj*j+l{^j+l,k) 



(36) 



Suppose that we let Kj = \mKj~\ for all j and some set of Kj, and that we then let m go to 
infinity. Assuming that the variances of the ratios of probabilities are finite, and that the Markov 
chain transitions used mix sufficiently well, a Central Limit Theorem will again apply, and we can 
conclude that all of the n terms in the sum above, and therefore also the sum itself, will approach 
Gaussian distributions, with variances proportional to 1/m. 

To analyse the LIS estimate in more detail, we need to assume a form of bridge distribution, as well 
as a form for p^. If p,, has the form of equation and we use the geometric bridge of equation ©, 
we can write 



log 



n-1 

E 

j=0 



log 



log 



1 



A:=0 



k=0 



(37) 



Since exp(z) ~ 1 + z and log(l + z) ~ z when z is small, we can rewrite this when n is large (and 
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hence rjjj^i — rjj is small) as 

n— 1 



log 



E 

j=0 



n-l 

E 

j=0 



loK 1 



log 1 + 



Vj+i-Vj 1 



2 i^,- + - 

fc=o 



k=0 



if. 



rjn-'nn-l 1 



fc=0 



Ko 



2 K( 

ri-l 



k=0 



(38) 



(39) 



A;=0 



^ 2 + 1 ^ ^ ^'"^ 



(40) 



A:=0 



When rjj = j /n, this looks like a stochastic form of the Trapezoidal Rule for numerically integrating 
equation (|34j) . Since the Trapezoidal Rule converges faster than Reimann's Rule, one might expect 
LIS to perform better than AIS asymptotically, but this is not so in this stochastic situation. Suppose 
for simplicity that we set all Kj = m. The variance of log Tl/s will be dominated by the variance of the 
last sum above, which will be proportional to 1/nm, assuming that m is large, so that the dependence 
between terms (from sharing link states) is negligible. Using the same argument as for AIS above, 
the variance of logrLis will be proportional to 1/nmM. Considering that the computation time for an 
LIS run will be proportional to nm, versus n for AIS, we see that the variances of the AIS and LIS 
estimates go down the same way in proportion to computation time, asymptotically as n and m go to 
infinity. 

Furthermore, the proportionality constant should be the same for AIS and LIS, assuming that the 
overhead of the two procedures is negligible compared to the time spent performing Markov transitions, 
so that the proportionality constants for computation time are the same for AIS (multiplying n) and 
for LIS (multiplying nm). The proportionality constants for variance for AIS (multiplying 1/nM) 
and for LIS (multiplying 1/nmAI) depend in a complex way on the form of the density of tjj values 
and on the mixing properties of the Markov transitions, but the result should be the same for AIS 
and LIS, provided the same scheme is used for choosing ijj values, and the same Markov transitions 
are used, parameterized smoothly in terms of r]. A difference that might appear significant is that 
for AIS only one Markov transition is done for each r]j, whereas for LIS, m such transitions are 
done. However, as n goes to infinity, nearby distributions become more similar, so transitions for m 
consecutive distributions become similar to m transitions for one of these distributions. 

The apparently pessimistic conclusion from this is that when both n and m (and hence the Kj) 
are large, the performance of LIS should be about the same as that of AIS (with n for AIS chosen to 
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equalize the computation time), assuming that the distributions used have the form of equation ()33|) . 
that the variance of U is finite under all of the distributions vr^, and that the Markov transitions used 
mix well enough. Fortunately, however, there is no reason to make both m and n large with LIS. For 
good performance, n must be large enough that vr^^. and vr^j^i^i overlap significantly, but there is no 
reason to make n much larger than this. The accuracy of the estimates can be improved as desired 
by increasing m and/or M while keeping n fixed. The results below show that LIS estimates with n 
fixed are sometimes much better than AIS estimates. 

Finally, let us consider the asymptotic performance of the bridged versions of AIS and LIS, assuming 
that the variance of U is finite, so that the distribution of the estimates from individual runs becomes 
Gaussian as n (for AIS) or m (for LIS) goes to infinity. Looking at equations (|27|) and (|28j) . which 
also are applicable to bridged AIS estimates, we see that the log of rLig_i3j.idg^j can for both optimal and 
geometric bridges be expressed as the difference of the log of the numerator, which is the mean of a 
function of the forward estimates, Vi^is, and the log of the denominator, which is the mean of a function 
of the reverse estimates, £lis- If these forward and reverse estimates have Gaussian distributions with 
small variances, o"^ and g^, then rLis-bridgcd will also be Gaussian, with a variance that can be computed 
in terms of the derivatives of the summands in the numerator and the denominator, with respect to 
f[*g and f LIS, evaluated at the true values of r and 1/r. I will assume that r = 1 below, as can be 
done without loss of generality. 

For the geometric bridge, these derivatives are both 1/2, from which it follows that the variance of 
the numerator in equation is a"^ /AM and that of the denominator is a"^ /AM . Since the numerator 
and denominator evaluate to one for f^}^ = r = 1 and TlIs = ^/f = 1, the sum of the variances of 
the logs of the numerator and denominator is a"^ /AM + /AM_. If cj^ = and M = M_, this reduces 
to a"^ /2M . The variance of an unbridged LIS estimate will be /M. However, the bridged estimate 
requires time proportional to M + M, compared to just M for the unbridged estimate. The value of 
M for the unbridged method can therefore be twice as large as for the bridged method, with the result 
that bridged and unbridged estimates perform equally well asymptotically (assuming the variance of 
\J is finite). 

For the optimal bridge, the derivatives of the summands in the numerator and denominator are 
both 1/4, when evaluated at Tlis = r = 1 and = 1/r = 1, and assuming that M = M_. The 
numerator and denominator both evaluate to 1/2, with the result that asymptotically the variance of 
the bridged estimate, assuming o"^ = q/^-, is (t^/2M, the same as for the geometric bridge. 

In conclusion, bridged AIS and LIS estimates asymptotically have the same performance as the 
corresponding unbridged estimates (with twice the value of M), for both the optimal and geometric 
bridges, assuming U has finite variance. This conclusion applies more generally, as long as a Central 
Limit Theorem holds for the individual estimates, Tl/s and £lis- However, the bridged methods may 
be much better when the variance of U is infinite, or for classes of distributions other than that of 
equation (|.S3j) . The bridged methods may also provide improvement when the values of n or m are 
not large enough for the asymptotic results to apply. 
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3.2 Properties of AIS and LIS when sampling from uniform distributions 

In this section, I will demonstrate that when n is kept suitably small, LIS can perform much better 
than AIS when these methods are applied to sequences of uniform distributions. 

As a first example, consider the class of nested uniform distributions with unnormalized densities 
given by 

, , f 1 if-s" <2; < s'? ,^ , 

I U otherwise 

for which the normalizing constants are = 2s'', so that r = Zi/Zq = s. The results concerning 
this class of distributions can easily be extended to any class of uniform distributions, in any number 
of dimensions, that have nested regions of support. For both AIS and LIS, I will assume that the 
intermediate distributions are defined by rjj = j /n. With this choice, the probability that a point, x, 
randomly sampled from TTj will have = 1 is s^''", for any j. 

During an AIS run, only a single point is sampled from each distribution. An AIS run will produce 
an estimate for r of zero if any of the ratios Pr^^^^{x^j^) / pr^.{x^^'^) in equation are zero, which 
happens with probability 1 — (s^/")" = 1 — s, and will otherwise produce an estimate of one. Note 
that the distribution of estimates is independent of n. AIS is therefore not a useful technique for 
nested uniform distributions — simple importance sampling (ie, AIS with n = 1) would work just as 
well (or just as poorly, if s is very small). Bridged AIS produces no improvement in this context. 

Suppose instead we use LIS with all Kj = m, and suppose that the Markov transitions, Tj, produce 
points that are almost independent of the previous point. For this problem, both the geometric and 
optimal forms of the bridge distribution result in = Pn^^^ix). If m + 1 points are sampled 

independently from vr^^ , the fraction of these points for which p-qj^i {x) is one will have variance s^/" (1— 
s^/") / (m+1). For sufficiently large m, the variance of the log of this fraction will be approximately 
(s^/" (1 — S"^/") / (m+1)) /s^/", which simplifies to (s~"^/" — 1) / (m+1). For this approximation to be 
useful, the probability that none of the m+1 points sampled from vr^^. lie in the region where Pri^+x is 
one, equal to (1 — si/"-)™+i^ must be negligible. This probability must be fairly small anyway, if LIS 
is to perform well. 

Suppose that the computational cost of an LIS run is proportional to the sum of the number of 
points sampled from ttq and the number of Markov transitions performed. If we fix this cost, the 
number of intermediate distributions, n, and the number of transitions for each distribution, m, will 
be related by m(n + l) = C, for some constant C. Assume for the moment that both n and m 
are large. The probability of a run producing a zero estimate will then be negligible, and we can 
assess the accuracy of the estimate for one run by the variance of logrLig (modified in some way to 
eliminate the infinity resulting from the negligible, but non-zero, probability that Tlis is zero). Looking 
at equation H36|) . we see that for these nested uniform distributions, the second log term vanishes — 
Pj*j+i{xj+i,k) / Prij+ii^j+i,k) is always one, since Pj*j+i is the same as Pr]j+i- When m is large, the 
dependence between terms with different values of j will be negligible, so we can add the variances of 
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the terms to get the variance of the estimate, obtaining the result that 

Varflog f[*|,) w (42) 



When n is large, s~^/" = exp(log(l/s)/n) is approximately 1 + log(l/s)/n, and hence the variance 
above is approximately log(l/s) / (m + 1). So it seems that the larger the value of m, the better — 
until we reach a value of m for which the corresponding value of n, equal to C/m — 1, is small enough 
that this result no longer applies. 

Best performance will therefore come using a fairly small value of n, but a large value of m. 
Substituting m = C/(n+l) into equation (|42|) . and assuming m/(m+l) ~ 1, we get 

Var(log r[il) ~ n (5-^/"-!) / (C/(n+l)) = n(n+l) (s^^/"- 1) / C (43) 

The value of n that minimizes this depends only on s, not on C. The optimal choice of n increases 
slowly as s gets smaller: s = 0.1 gives n = 2, s = 0.05 gives n = 3, s = 0.01 gives n = 4, and 
s = 0.0001 gives n = 7. 

As a second example, consider the class of non-nested uniform distributions with unnormalized 
densities given by 

p,w = (; "r-^"^""'^' (44) 

I otherwise 

For this class, = 2 for all r/, so r = Zi/Zq = 1. I will again assume that the intermediate 
distributions are defined by r]j = j/n, and that all Kj = m. Assuming that n is greater than t/2, the 
probability that a point, x, randomly sampled from vr^^. will have Pn^^^ix) = 1 is 1 — t/2n, for any j. 

For this example, AIS estimates do not converge to the true value of r as M increases, regardless 
of the value of n. To see this, note that the ratios in equation ((Tj) will all be either zero or one, and 
that the estimate from one run, fXfg, will be one if all of these ratios are one, and zero otherwise. The 
probability of a particular ratio being one is 1 — t/2n, so the probability that all are one (assuming the 

produce points independent of the current point) is (1 — t/2n)"', which approaches exp(— 1/2) as n 
goes to infinity. The AIS estimate, averaging over M runs, will have mean exp(— 1/2), rather than the 
correct value of one. 

In contrast, bridged AIS estimates will converge to the true value as M increases, as long as n is at 
least t/2, so that there is overlap between successive distributions in the sequence. However, when t 
is large, the overlap between the distributions over paths produced by forward and reverse AIS runs, 
given by exp(— 1/2), will be very small, and the procedure will be very inefficient. 

To see how well LIS performs, recall the formula for logr^g from equation (|36() : 



log r[7s = ^ 
j=0 



log I ^ Pj*j+iixj,k) I _ j^^, I 1 Pj*j+iixj+i,k) 



(45) 
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Due to symmetry, the two log terms above have the same distribution, for all j. The variance of 
one of these log terms (for large m) is ((t/2n) (l-t/2n)/(m + l))/ (l-t/2n)2, which simplifies to 
1 / ((2n/t— 1) (m+l)). The second log term in equation (|36|1 for one j will involve the same points, x^+i^fc, 
as the first log term for the next j. The effect of this is that these terms will be negatively correlated, 
with correlation of —1 if n = t. However, since the two terms occur with opposite signs, the effect on the 
final sum is that n—1 pairs of terms (out of 2n terms total) are positively correlated. Straightforward 
calculations show that this correlation is 2n/t — 1 for t/2 < n < t and 1 / {2n/t — 1) for n > t. Using 
the fact that when X and Y have the same distribution, \ar{X + Y) = 2 Var(X) [1 + Cor{X,Y)], we 
obtain the result that, for large m, 

Varflog ~ ' [ " + (-^(^-A-l) ^it/2<n<t | 

V "V (2n/t-l)(m + l) I „ + („_l)/(2n/t-l) if n > t J 

Setting m = C/(n + l), and assuming m/(m + l) ~ 1, gives 

/ 2(n+l) r n + (n-l)(2nA-l) if t/2 < n < t ) 

Var log ri,/s < } (47) 

V ; C{2n/t-l) y n + (n-l)/ {2n/t-l) if n > t J 

Numerical investigation shows that the global minimum of the variance occurs where n is near (3/2) t. 
A second local minimum where n is near (3/4) t also exists. The two minima are nearly equally good 
when t is large. There is a local maximum where n is near t, with the variance there being about 
19% greater than at the global minimum. The variance is much larger for very large and very small 
values of re. We therefore see that for this example too, the best results are obtained by fixing re to a 
moderate value; any desired level of accuracy can then be obtained by increasing m and/or M. 



4 Empirical comparisons of AIS and LIS 

The analytical results of the previous section indicate that LIS can sometimes perform much better 
than AIS, but that the benefits of LIS may only be seen when the number of intermediate distributions 
used is kept suitably small (but not so small that they do not overlap). In this section, I investigate 
the performance of AIS and LIS (and their bridged versions) empirically. The programs used for these 
tests (written in R) are available from my web page. 

These tests were done using sequences of one-dimensional distributions having unnormalized density 
functions of the following form: 

Pnix) = exp {x-r]t) / (48) 

where s, t, and q are fixed constants. As r] moves from to 1, the centre of this distribution shifts 
by and changes width by the factor s. The power q controls how thick the tails of the distributions 
are. When q = 2, the distributions are Gaussian; a larger value produces lighter tails. Note that 
is proportional to s^, and hence r = Zi/Zq is equal to s. 
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q = 2, t = 0, s = 0.05 




q = 10, t = 4, s = 1 



-3-2-10 1 2 3 
q = 10, t = 0, s = 0.05 




Figure 2: The sequences of unnormalized density functions used for the tests. The plots show the 
unnormahzed density functions for rj = 0, 1/4, 2/4, 3/4, 1, for six combinations of s, t, and q. 



If t = 0, the distributions can be written in the form of equation 1)33(1 . after reparameterizing in 
terms of r]' = so that Prj'{x) = exp(— t/'|x|''). In this case, we expect the asymptotic behaviour 

to be as discussed in Section ITTl but the behaviour with samples of practical size may be different. 
As q goes to infinity, the distributions converge to uniform distributions over (rjt—s^, rjt+s^), and the 
results of Section ROl become relevant. 

I did an initial set of tests using six sequences of distributions. Three of these sequences were of 
Gaussian distributions, with q = 2. The first of these used s = l and t = 4, producing a shift with no 
change in scale as r] increases from to 1. The second used s = 0.05 and t = 0, producing a contraction 
with no shift. The last used s = 0.3 and t = 2, combining a shift with a contraction. A second set of 
three sequences used the same values of s and t, but with q = 10, which produces more 'rectangular' 
distributions with lighter tails. The six sequences are shown in Figure l2j Each sequence in these plots 
consists of five distributions, corresponding to r/ = 0, 1/4, 2/4, 3/4, 1. These were the sequences used 
for the LIS runs (hence n = 4 for these runs). The AIS runs used more distributions, spaced more finely 
with respect to 77, so as to produce the same number of Markov transitions and sampling operations 
as in the LIS runs. 

These distributions (for any rj) can easily be sampled from using rejection sampling. Samples from 
TTo and TTi were used to initialize forward and reverse runs of AIS and LIS. For this test, we pretend 
that sampling for other vr^ must be done using Markov chain methods. The transition used for 7r,j, T^, 
was a random-walk Metropolis update, using a Gaussian proposal distribution with mean equal to the 
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current point and standard deviation s^. Since Metropolis updates are reversible, T_.^ was the same. 

Two sets of forward and reverse LIS runs were done with n = 4, all Kj = 50, and M = 20, one 
set using the geometric bridge, the other using the optimal bridge with the true value of r. The 
forward estimates were computed from equation (|1U|) : the reverse estimates from equation 1)22(1 . which 
is equivalent to using the forward procedure with the reverse sequence of distributions. Bridged LIS 
estimates were also found using equation H27|). with the value of r found by iteration. To make the 
comparison with forward and reverse estimates fair, the bridged LIS estimates used M = 10 — ie, only 
half of the forward and half of the reverse runs were used, for a total of 20 runs. 

A corresponding set of forward, reverse, and bridged AIS runs were also done, with n = 250 and 
M = 20 (M = 10 for the bridged estimates). If sampling a point from ttq or vri takes about the same 
computation time as a Metropolis update, these AIS runs will take about the same time as the LIS 
runs. (This assumes that sampling and Markov transitions dominate the time, which is typically true 
for real problems but perhaps not for this simple test problem.) 

Sets of longer LIS and AIS runs were also done, which were the same as the sets above except that 
for LIS, Kj = 200 for all j, and for AIS, n = 1000, which again equalizes the computation time. 

Experience, together with the asymptotic results of Section \'A.1[ shows that estimates produced 
using a small value of M are better than, or at least as good as, those produced with larger M. I 
chose M = 20 (M = 10 for bridged estimates) since this is about the smallest value that allows reliable 
estimation of standard errors, which would usually be needed in practice. 

The standard errors for AIS and LIS estimates of f were estimated by the sample standard deviation 
of the r*^*) divided by \/M. When comparing the methods, I looked primarily at the mean squared 
error when estimating log(r) (rather than when estimating r). The estimate I used was log(f), and the 
standard error for this estimate was estimated by the standard error for r divided by r. For the reverse 
runs, log(r) was estimated by — log(r). For bridged AIS and LIS, the standard errors for the log of 
the numerator and the log of the denominator of equation (|27() were found, and the overall standard 
error was computed as the square root of the sum of the squares of these two standard errors. This 
method of converting estimates and standard errors for r to those for log(r) is valid asymptotically. 
It might be improved upon for finite samples, but such improvements would probably not affect the 
relative merits of the methods compared here. 

Figures 01 through |H1 plot the mean squared errors of estimates for log(r) for the six sets of runs. 
Results are shown for AIS, for LIS using the geometric bridge, and for LIS using the optimal bridge, 
with the true value of r. Results for both the forward and reverse versions of each method are shown, 
together with the bridged version, using the optimal bridge, with r obtained by iteration. Results 
for the short runs (n = 4, Kj = 50 for LIS, n = 250 for AIS) are on the left, and for the long runs 
(n = 4, Kj = 200 for LIS, n = 2000 for AIS) on the right. The mean squared error for each method was 
estimated by simulating each method 2000 times, and comparing the estimates with the true value of 
log(r). The bars in the plots are dark up to the estimated mean squared error minus twice its standard 
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Figure 3: Results of short and long runs on the distribution sequence with s = l, t = 4, and q = 2. 
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Figure 4: Results of short and long runs on the distribution sequence with s = l, t = A, and g = 10. 
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Figure 5: Results of short and long runs on the distribution sequence with s = 0.05, t = 0, and q = 2. 
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5 Short Runs Long Runs 

Figure 6: Results of short and long runs on the distribution sequence with s = 0.05, t = 0, and q = lO. 
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Figure 7: Results of short and long runs on the distribution sequence with s = 0.3, t = 2, and q = 2. 



AIS:n=250, LIS: n=4 m=50 

AIS LIS-geometric LIS-optimal 



0.24 0.37 



03 - 



for rev bri 



for rev bri 



for rev bri 



AIS:n=1000, LIS: n=4 m=200 

AIS LIS-geometric LIS-optimal 



■ 



for rev bri 



for rev bri 



for rev bri 



Figure 8: Results of short and long runs on the distribution sequence with s = 0.3, t = 2, and q=10. 
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error, and are then light up to the estimated mean squared error plus twice its standard error. For 
bars that extend above the plot the estimated mean squared error is shown at the top of the bar. 

The results for translated sequences of distributions (t = 4 and s = l) are shown in Figures 01 and 0] 
When the distributions are Gaussian (q = 2), no advantage is seen for LIS — if anything, LIS performs 
slightly worse than AIS, particularly when the geometric bridge is used. The forward and reverse forms 
of AIS and LIS should have identical performance for these distribution sequences, due to symmetry; 
any differences seen result from random variation. The bridged forms of both AIS and LIS perform 
better than the unbridged forward and reverse forms. The advantage of bridging is less for the longer 
runs, however, as expected from the analysis at the end of Section ITTl 

When q = 10, the distributions have much lighter tails than the Gaussian, more closely resembling 
the uniform distributions analysed in Section 12121 For these sequences of distributions, LIS performs 
substantially better than AIS. The unbridged version of AIS does particularly badly. The mean 
squared error for the bridged version of AIS is about 2.5 times greater than for the bridged version of 
LIS. It makes little difference whether the geometric or optimal bridge is used for LIS. 

Figures El and El show the results for sequences of distributions with the same mean (t = 0) but 
decreasing width (s = 0.05). For these sequences, a modest advantage of LIS over AIS is apparent 
for the sequence of Gaussian distributions (g = 2), with the variance for AIS estimates being about a 
factor of 1.3 greater than for LIS estimates with the geometric bridge, and about a factor of 1.7 greater 
than for LIS estimates with the optimal bridge. The reversed AIS and LIS estimates are somewhat 
worse than the forward estimates for this sequence of distributions. No advantage is seen for bridged 
AIS or LIS estimates. 

The results for the sequence of distributions with q = 10 is similar, except that the advantage of LIS 
over AIS is much greater — about a factor of 6. 

Results for the last type of sequence, with s = 0.3 and t = 2, are shown in Figures [71 and |H1 This 
problem is a hybrid of the previous two, with both translation and change in width, producing results 
intermediate between those for the previous two problems. No difference in performance between AIS 
and LIS is apparent for the Gaussian distributions {q = 2), but the bridged forms of both perform 
slightly better. For the sequence of distributions with q = W, a clear advantage of LIS over AIS can 
be seen, but this advantage is not as great as for the sequence with t = and s = 0.05. The bridged 
forms of both AIS and LIS are again better, more so for the short runs than for the long runs. 

In addition to looking at the mean squared error of estimates found with these methods, I also 
looked at the fraction of times that the estimate for log(r) differed from the true value by more 
than twice the standard error estimated using the M runs. This should be approximately 5% if the 
distribution of estimates is Gaussian, and the standard errors are accurate. For the longer runs, this 
fraction was indeed near or only slightly above 5% for all methods, except for the unbridged AIS runs 
when these performed very poorly. For the shorter runs, however, the unbridged AIS and LIS methods 
produced estimates more than two standard errors from the mean around 10% of the time (sometimes 
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much more often, when unbridged AIS performed poorly). Both the bridged AIS and the bridged LIS 
methods gave more rehable standard errors. However, it is possible that better standard errors for 
the unbridged methods might be obtained with a more sophisticated approach than I used. 

I performed additional runs to verify and extend some of the analytic results from Section |21 
Figures iQland lTIIl show results obtained using LIS with increasing numbers of intermediate distributions, 
starting with the value of n = 4 used for the tests above, and continuing to n = 9, n = 19, and 
n = 39, while keeping the computation time constant by decreasing m in proportion to n+1. The two 
distribution sequences with s = l and t = 4 and with s = 0.05 and t = were used, in both cases with 
q = 10. The sequence with t = and s = 0.05 has the form of equation H33() . so in accordance with the 
analysis of Section 13. ![ we expect that asymptotically, as n increases, LIS and AIS should have the 
same performance. This is indeed what we see in Figure El We also see the same behaviour for the 
sequence with t = 4 and s = 1 in Figure llfll 

As q increases, the distributions become close to uniform, and the results of Section 13.21 should 
apply. To test this, I tried values of q = 2, q = lO, q = 20, and q = 30 for the distribution sequence with 
s = l and t = 4 and the sequence with s = 0.05 and t = 0. Results are shown in Figures ITTl and [T2l (The 
results for q = 2 and ^^ = 10 are the same as on the left in Figures |31 to IHl though the scale differs.) 

For the sequences with s = 1 and t = 4, the limiting uniform distributions have the form of the 
second example in Section [3. 21 As noted there, AIS estimates do not converge to the correct value of 
r for this distribution sequence; bridged AIS estimates do converge, but may be rather inefficient. We 
see analogous behaviour in Figure ITT] when q is large. The mean squared error of the AIS estimates 
increases approximately linearly with q over the range q = W to q = 30. The bridged AIS estimates also 
get worse as q increases, but more slowly. In contrast, the mean squared error of the LIS estimates 
changes hardly at all as q increases. 

The story is similar for sequences with s = 0.05 and t = l, for which the limiting uniform distributions 
correspond to those in the first example of Section f3. 21 The LIS estimates perform about equally well 
for all values of q, but the AIS estimates are dramatically worse for large values of q. For this sequence, 
reverse AIS estimates are much worse than forward AIS estimates, and bridging does not help. 

According to the analysis of Section 13.11 the choice of choice of n = 4 for LIS used above is not 
optimal for either of these distribution sequences when q is large. For the sequence with s = 1 and 
t = 4, using n = 6 should be better by a factor of 1.176. However, in LIS runs with q = 30, the mean 
squared error using n = 4 and m = 200 is indistinguishable from that using n = 6 and m = 143, 
given the standard errors (a factor of 1.09 or more should have been detectable). Of course, q = 30 
does not give exactly uniform distributions, and these values of m may not be large enough for the 
asymptotic results to apply, especially since the Markov transitions do not sample independently. For 
the sequence with s = 0.05 and t = 0, the results in Section 13.11 indicate that using n = 3 should be 
better by a factor of 1.084. In this case, LIS runs with q = 30 using n = 3 and m = 250 are better than 
runs using n = A and m = 200 by a factor of 1.16, significantly greater than one given the standard 
errors, but not significantly different from the expected ratio of 1.084. 
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Figure 9: Results using increasing values of n for LIS, while keeping computation time constant, for 
the distribution sequence with s = 1, t = 4, and q = 10. The same AIS procedure was used for all plots, 
but results vary randomly. 
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Figure 10: Results using increasing values of n for LIS, while keeping computation time constant, for 
the distribution sequence with s = 0.05, t = 0, and q = 10. The same AIS procedure was used for all 
plots, but results vary randomly. 
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Figure 11: Results with increasing values of q, for sequences of distributions with s = l and t = 4. The 
AIS runs used n = 250; the LIS runs used ri = 4 and m = 50, requiring the same amount of computation. 
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Figure 12: Results with increasing values of q, for sequences of distributions with ,s = 0.05 and t=\. 
The AIS runs used n = 250; the LIS runs used n = 4 and m = 50, requiring the same amount of 
computation. 
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5 Other applications of linked sampling 

So far in this paper, I have focused on how Linked Importance Samphng can be used to estimate ratios 
of normahzing constants. LIS can also be used to estimate expectations with respect to vri, however, 
and in some apphcations, this may be its most important use. Linked samphng methods related to 
LIS can also be applied in other ways. I briefly described these other applications here, outlining the 
use of linked sampling for 'dragging' fast variables in some detail. 

5.1 Estimating expectations 

The expectation of some function, a(x), with respect to tti can be estimated using simple importance 
sampling, with points drawn from ttq, as follows: 

/ ^ ivlj^'^^^'^^^^pyy / iv|j^^p)y ^^^^ 

where x^'^\ . . . , x^^^ are drawn from ttq. Like equation 0, this estimate is valid only if no region 
having zero probability under ttq has non-zero probability under vri. The two factors of 1/A^ of course 
cancel, but are included to emphasize the connection with the estimate for r = Zi/Zq, which is simply 
the denominator of the estimate above. 

Since LIS can be viewed as simple importance sampling on an extended state space, with distribu- 
tions Ho and Hi defined by the forward and reverse procedures of Section [21 we can use equation H49() 
to estimate any quantity that can be expressed as an expectation with respect ot Hi. Step (1) of the 
reverse procedure defining Hi sets Xn,^„ to a value randomly chosen from vr^^ = vri. Step (2) then 
sets the other Xn^k to values obtained from Xn,^„ by applying Markov chain transitions that leave vri 
invariant. It follows that under Hi, all the points Xn,k have marginal distribution vri (though they 
may not be independent). Accordingly, 

E^^[a{X)] = En, 

Estimating the right side as in equation H49() . and using the fact that the ratio of probabilities under 
Hi over those under Ilg is given by r^jg in equation (fTU)) . we get the estimate 

M .(i) K„ M 

E.Aa{X)] ^ E^E«(-a)/E^"s (51) 
i=l k=0 1=1 

If the M runs of LIS are started by sampling independently from ttq (as will often be possible), the 
standard error of this estimate can be assessed in the usual fashion for importance sampling, as I have 
discussed for the analogous AIS estimates in (Neal 2001). This error assessment can be difficult, since 
when some Tlis much larger than others, the variance of Tl/s is hard to estimate. Note, however, 
that the degree to which the Markov chain transitions used have converged need not be assessed, a 
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(50) 
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possible advantage compared with simple MCMC estimates. The estimate of equation will be 
asymptotically correct (as M ^ oo) regardless of how far these Markov chain transitions are from 
convergence. 

The primary reason one might wish to use LIS to estimate expectations is that going through 
the sequence of distributions parameterized by 770,..., r/„ may produce an 'annealing' effect, which 
prevents the Markov chain sampler from being trapped in a local mode of the distribution. Compared 
with the analogous AIS procedure, LIS may perform better for some forms of distributions, for the 
same reasons as were discussed in Sections 01 and 01 One should also note that LIS estimates for 
expectations with respect to vr^^ for all j can easily be obtained from a single set of runs, by simply 
considering the results of each LIS run up to the point where the sample for vr^^ is obtained. 

5.2 A linked form of tempered transitions 

My 'tempered transition' method (Neal 1996) is another approach to sampling from distributions 
with isolated modes, between which movement is difficult for Markov chain transitions such as simple 
Metropolis updates. In this approach, such simple Markov chain transitions are supplemented by 
occasional complex 'tempered transitions', composed of many simple Markov chain transitions. A 
tempered transition consists of several stages, which proceed through a sequence of distributions, from 
the distribution being sampled, to a 'higher temperature' distribution in which movement between 
modes is easier, and then back down to the distribution being sampled. At each stage of a tempered 
transition, we generate a single new state by applying a Markov chain transition to the current state, 
after which we switch to the next distribution in the sequence. The second half of a tempered transition 
is similar to an Annealed Importance Sampling run, while the first half is similar to an AIS run with 
the reversed sequence of distributions. 

A similar 'linked' procedure can be defined, in which at each stage we generate a chain of states by 
applying a Markov chain transition. We then select a 'link state' from this sequence (using a suitable 
bridge distribution) which serves as the starting point for the chain of states generated in the next 
stage. In the final stage, a chain of states is produced using a Markov chain transition that leaves the 
distribution being sampled invariant, and a candidate state is selected uniformly at random from this 
chain. The appropriate probability for accepting this candidate state is computed using ratios similar 
to those going into the LIS estimate of equation H10|) . 

As discussed in Section jH for AIS to work well, all distributions in the sequence must assign 
reasonably high probability to regions of the space that have non-negligible probability under the next 
distribution in the sequence. One would expect tempered transitions to work well only when this 
holds for both the sequence and its reversal. In contrast, one would expect the 'linked' version of 
tempered transitions to work well as long as the sequence satisfies the weaker condition that there be 
some 'overlap' between adjacent distributions (assuming a suitable bridge distribution is used). 
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5.3 Dragging fast VEiriables using linked chains 

A slight modification of tlie tempered transition metliod can be applied to problems in which the state 
is composed of both 'fast' and 'slow' variables. We will write the distribution of interest for such a 
problem as 

7r(x,y) = (1/Z) exp(-C/(x,y)) (52) 

where x denotes the 'fast' variables and y the 'slow' variables. We assume that the computation 
is dominated by the time required to evaluate U{x,y), but that once U{x,y) has been evaluated, 
with relevant intermediate quantities saved, evaluating U{x',y) for any new x' is much faster than 
evaluating U {x', y') for some y' not previously encountered. One example of such a problem is inference 

for Gaussian process classification models (Neal 1999), in which y consists of the hyperparameters 
defining the covariance function used, and x consists of the latent variables associated with the n 
observations. After a change to y, we must recompute the Cholesky decomposition of an n x n 
covariance matrix, which takes time proportional to n^, whereas after a change to x only, U{x,y) can 
be re-computed in time proportional to n?, assuming the Cholesky decomposition for this value of y 
has been saved. 

In my method for 'dragging' fast variables (Neal 2004), the ability to quickly re-evaluate U{x,y) 
when only x changes is exploited to allow larger changes to be made to y than would be possible 
if X were kept fixed, or were given a new value from some simple proposal distribution. Prom the 
state {xo,yo), a dragging update proposes a new value yi, drawn from some symmetrical proposal 
distribution, in conjunction with a new value xi that is found by applying a succession of Markov 
chain updates that leave invariant distributions in the series, tt^^. (x), for j = l,...,n — 1, with < 
rjj < rjj^i < 1. The proposed state, {xi,yi), is then accepted or rejected in a fashion analogous to 
tempered transitions. 

The distributions in the sequence used are defined by the following unnormalized probability or 
density function, which depends on the current and proposed values for y: 

Pr,{x) = exp(-((l-r7)[/(x,yo) + r/C/(a;,yi))) (53) 

The corresponding normalized probability or density function will be written as vr^. Note that '7ro(x) = 
7r(x|yo) and tti{x) = 7r(x|yi). Crucially, after U{x,yQ) and U{x,yi) have been evaluated once (for 
any x), we can evaluate Pr]{x) for any t] and any x without any further 'slow' computations. Indeed, 
since U{xo, yo) will usually have already been evaluated as part of the previous Markov chain transition, 
only one slow computation will be required to evaluate Pr]{x) for any number of values of rj and x. 

A 'linked' dragging update can be defined as follows. Given the sequence of distributions defined 
by 770, . . . ,r/n, with % = and ry^ = 1, the numbers of transitions (T or T) to perform for each 
distribution over x, denoted by Kq, . . . ,Kn, and a set of bridge distributions, denoted by for 
j = 0, . . . , n—1, an update from the current state {xq, yo) is done as follows: 
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The Linked Dragging Procedure 



1) Propose a new value, yi, from some proposal distribution S{yi\yo), whicii satisfies the symmetry 

condition that 5(yi|yo) = 5'(yo|yl)• 
2) Pick an integer vq uniformly at random from {0, . . . , -f^o}) and then set xo^^o to the current values 

of the fast variables, xq. 

3) For j = 0, . . . , n, create a chain of values for x associated with vr^^ as follows: 

a) If j > 0: Pick an integer Uj uniformly at random from {0, . . . , Kj}, and then set xj^^j to 

Xj-Uj. 

b) For k = Vj + 1, . . . , Kj, draw Xj^/. according to the forward Markov chain transition prob- 
abilities Trj^{xj^k-i, Xj^k)- (If i^j = Kj, do nothing in this step.) 

c) For k = Uj — 1, . . . ,0, draw Xj^k according to the reverse Markov chain transition probabil- 
ities T^^ (xj^fc+i, Xj^k). (If Vj = 0, do nothing in this step.) 

d) If j < n: Pick a value for fij from {0, . . . , Kj} according to the following probabilities 

no(/.,|x,) = / (54) 

and then set Xj^:j^i to xj^^^ . 

3) Set to a value chosen uniformly at random from {0, . . . , Kn}, and let the proposed new values 
for the fast variables, xi, be equal to Xn,^„- 

4) Accept {xi,yi) as the new state with probability 

n-l Kj , . i^j+i , , 

i Pj*j+i{Xj,k) j t >r-^ Vj*j^-\\Xj+\,k) 

+ 1 ^ Vr^, iXj,k) I Kj+i + 1 ^ pr,^^, {Xj+l,k) 

If {xi,yi) is not accepted, the new state is the same as the old state, (xo,yo)- 




(55) 



One can show that this update leaves 7r(x, y) invariant by showing that it satisfies detailed balance, 
which in turns follows from the stronger property that the probability of starting at (xq, yo), assuming 
this start state comes from iT{x,y), then generating the various quantities produced by the above 
procedure, and finally accepting (xi,yi) as the new state, is the same as the probability of starting 
this procedure at (xi,yi), generating the same quantities in reverse, and finally accepting (xo,?/o)- 
The proof of this is analogous to the derivation of LIS in Section [51 

To use the linked dragging procedure, we need to select suitable bridge distributions. Since the 
characteristics of 7r.^(x) will depend on yo Hi: of course rj, we may not know enough to select 
good estimates for the values of r needed to use the optimal bridge of equation © , though we might 
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try just setting r to one. This is not a problem for the geometric bridge of equation for which the 
acceptance probabihty above can be written as 
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From equation ((^ . we see that 



exp(- (r/j+i-r/j) fc, yi)-?7(j;j- fc, yo))) 

exp(- (r/j+i-r/j) (C/(xj+i,fc, ?/o) -C^(2;j+i,fc, yi))) 



(57) 
(58) 



For the simplest case with no intermediate distributions (ie, with n = 1), the acceptance probability 
simplifies to 
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6 Conclusions and Future work 

In this paper, I have demonstrated that in some situations Linked Importance Sampling is substan- 
tially more efficient than Annealed Importance Sampling, provided a suitable number of intermediate 
distributions are used. However, in other situations, where the tails of the distributions involved are 
sufficiently heavy, the two methods are about equally efficient. More research is therefore needed to 
determine for which problems of practical interest LIS, and related linked sampling methods, will be 
useful. 

In tests on multivariate Gaussian distributions, I have not seen an advantage for LIS over AIS. Both 
perform about equally well on a sequence of 100-dimensional spherical Gaussian distributions with 
variances changing by a factor of two, so that log(r) = —100. This is in accord with the results in 
Section [IJ where LIS had little or no advantage over AIS when the distributions were Gaussian. LIS 
is more likely to be useful for problems involving continuous distributions with lighter tails. 

One problem that may benefit from LIS is that of computing the probability of a very rare event, 
which can be cast as computing the normalizing constant for a distribution with the constraint that 
the state be in the set corresponding to this event. Intermediate distributions might use looser forms 
of this constraint. If, in all these distributions, states violating the constraints have zero probability, 
AIS will tend to have the same bad behaviour seen with uniform distributions in Section [3.21 while 
LIS may work much better. 
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Another context where LIS may outperform AIS is when only a fixed number of intermediate 
distributions are available — ie, only a finite number of values are allowed for r]. This is the situation 
for the 'sequential importance sampler' of MacEachern, Clyde, and Liu (1999), which can be seen as 
an instance of AIS (Neal 2001). Here, the intermediate distributions use only a fraction of the n items 
in the data set; such a fraction can only have the form j/n with j an integer. The distance between 
successive distributions for this problem may sometimes be too great for AIS to work well, but their 
overlap might nevertheless be sufficient for LIS. 

It may be possible to improve LIS by reducing the variance in how well it samples at each stage. 
Instead of performing a predetermined number, Kj, of Markov transitions at stage j, we might instead 
perform as many transitions as are necessary to obtain a good sample. Define a 'tour' to be a 
sequence of transitions that moves from a high value of some key quantity (eg, U{x) for the canonical 
distributions of equation (^) to a low value of this quantity, or vice versa. Good sampling might 
be ensured by performing some predetermined number of tours, with the number of these tours that 
occur before and after the link state being chosen at random. Suitable 'high' and 'low' values would 
probably need to be found using preliminary runs. 

More speculatively, it seems as if there should be some method that has the advantages of LIS 
over AIS, but that like AIS uses many intermediate distributions, performing only a single Markov 
transition for each. Intuitively, it seems that such a 'smooth' method that does not abruptly change 
r] should be more efficient. One can use LIS with all Kj set to one, but this will produce good results 
only if n is large, which we saw in the analysis of Section 13.11 does not lead to an advantage over 
AIS. Perhaps some way could be found of using states associated with all values of rj when estimating 
each of the ratios Z^^^^/Z^^, while still producing an estimate that is exactly unbiased even when the 
Markov transitions do not reach equilibrium. 
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